In these exercises, we will be going through the dataset you preprocessed yesterday and perform three types of sentiment analysis:
Setup your R-session and load your data so that we can perform sentiment analysis. Assign the loaded data to a dataframe with the name comments.
You should use the options() function to prevent R from interpreting your character variables as factor variables. If you are not sure how to use the options() you can always search it in Rstudios help panel and have a look at all the different options there are. After setting the options, you should load your preprocessed dataset with readRDS(). Don’t forget to attach the necessary packages syuzhet and sentimentr.
# Preventing R from inteprreting characters as factors
options(stringsAsFactors = FALSE)
# attaching packages
library(syuzhet)
library(sentimentr)
# Loading data
comments <- readRDS("../data/ParsedComments.rds")
Chose the appropriate column from your comments dataframe to perform a basic sentiment analysis on. Which columns are suitable and which are not? Save comment sentiments in a new variable called BasicSentimentSyu and check whether the column has any zero values. If there are zero values, why might this be the case?
hyperlinks and emoji might cause Problems for sentiment analysis (or any textmining methods really). You can check whether a variable contains a given value x using the following approach table(variable == x) with your respective variable name.
# Creating new column
BasicSentimentSyu <- get_sentiment(comments$TextEmojiDeleted)
# checking zero values
table(BasicSentimentSyu == 0)
##
## FALSE TRUE
## 4630 1147
Zero values are given to comments containing no words from the used dictionary or containing multiple words with sentiment scores that cancel each other out exactly.
syuzhet package and the get_sentiment() function to see which dictionaries are available. Create a correlation matrix for sentiment scores using the different methods (you can leave out stanford). Which factors might result in low correlations between the dictionaries? Which one is the best to use?
You can find the documentation for the get_sentiment() function by searching for it’s name in the Rstudio help panel or by entering ?get_sentiment() in your console. You can also search online for further information. A correlation matrix can be created with the cor function. As this function needs a dataframe as an input, you need to create one variable for each sentiment dictionary rating and combine it into a dataframe with cbind.data.frame() before passing it to cor.
Which dictionary is best always depends on your research question and what kind of data you want to use. In general, you should pick a dictionary that is as similar to your data as possible and is most sensitive to the kind of sentiment that you are interested in (dictionaries sometimes contain mainly positive or mainly negative entries).
# computing sentiment scores with different dictionaries
BasicSentimentSyu <- get_sentiment(comments$TextEmojiDeleted,method = "syuzhet")
BasicSentimentBing <- get_sentiment(comments$TextEmojiDeleted,method = "bing")
BasicSentimentAfinn <- get_sentiment(comments$TextEmojiDeleted,method = "afinn")
BasicSentimentNRC <- get_sentiment(comments$TextEmojiDeleted,method = "nrc")
# combining them to a dataframe
Sentiments <- cbind.data.frame(BasicSentimentSyu,
BasicSentimentBing,
BasicSentimentAfinn,
BasicSentimentNRC)
# setting colnames
colnames(Sentiments) <- c("Syuzhet",
"Bing",
"Afinn",
"NRC")
# Correlation Matrix
cor(Sentiments)
## Syuzhet Bing Afinn NRC
## Syuzhet 1.0000000 0.7491930 0.7413473 0.6519061
## Bing 0.7491930 1.0000000 0.6705407 0.4636890
## Afinn 0.7413473 0.6705407 1.0000000 0.4323534
## NRC 0.6519061 0.4636890 0.4323534 1.0000000
Standardize the comment sentiments for the syuzhet method with respect to the total number of words in the respective comment. Call this new Variable SentimentPerWord.
Computing the number of words requires multiple functions if you want to use base R. The strplit() command splits a character string into multiple strings upon a specific indicator, for example a space (" "), the unlist() command transfers a list of values into a regular vector. The length() function counts the number of elements in a vector and with the sapply() function, you can apply a general function to each element of a vector. With these tools, you can compute the number of words per comment.
# computing number of Words
Words <- sapply(comments$TextEmojiDeleted,function(x){length(unlist(strsplit(x," ")))})
# computing average sentiment per word
SentimentPerWord <- BasicSentimentSyu/Words
Compute comment sentiments using the sentimentr package. Compare the average comment sentiment per word from the sentimentr package with the one we computed. Which one do you think is more trustworthy and why?
For a total sentiment score per comment, you first have to use the get_sentences() function and then use the sentiment_by() function on the sentences. To plot the two different scorings against each other, you need to put them into the same dataframe with cbind.data.frame() first. You can then use the ggplot() package for plotting.
# computing sentiment scores
Sentences <- get_sentences(comments$TextEmojiDeleted)
SentDF <- sentiment_by(Sentences)
# show output
SentDF[1:3,c(2,3,4)]
## word_count sd ave_sentiment
## 1: 14 NA 0.0000000
## 2: 30 0.1872113 0.1444742
## 3: 93 0.2678392 -0.1413356
# Attaching ave_sentiment to comments dataframe
comments <- cbind.data.frame(comments,ave_sentiment = SentDF$ave_sentiment)
# plotting SentimentPerWord vs. SentimentR
library(ggplot2)
ggplot(comments, aes(x=ave_sentiment, y=SentimentPerWord)) +
geom_point(size =0.5) +
ggtitle("Basic Sentiment Scores vs. `SentimentR`") +
xlab("SentimentR Score") +
ylab("Syuzhet Score") +
geom_smooth(method=lm, se = TRUE)
SentimentR is: - better at dealing with negations - better at detecting fixed expressions - better at detecting adverbs - better at detecting slang and abbreviations
lexicon package and copy it to a new dataframe callend EmojiSentiments. Change the formatting of the dictionary entries and/or our Emoji column so that they are in the same format and can be matched. You can use the name EmojiToks for an intermediary variable if you need to create one. Afterwards, transform the EmojiSentiment dataframe to a quanteda dictionary object with the as.dictionary() function. Finally, use the tokens_lookup() function to create a new variable for emoji sentiments called EmojiToksSent
To see lexicons from the lexicon package, you can run lexicon::available_data() to get an overview of all the available lexicons. The name of the emoji lexicon is “emojis_sentiment”. Lexicons can be accessed with the command lexicon::lexicon_name usng the respective name of the lexicon you want to select. You can use the paste0() and gsub() functions to bring the formatting of the emoji column in line with the dictionary. Keep in mind that a valid dictionary needs appropriate column names, you can look this up in the help section of the as.dictionary() function.
# attaching packages
library(quanteda)
library(qdapRegex)
# emoji Sentiments
EmojiSentiments <- lexicon::emojis_sentiment
EmojiSentiments[1:5,c(1,2,4)]
## byte name sentiment
## 1 <f0><9f><98><80> grinning face 0.5717540
## 2 <f0><9f><98><81> beaming face with smiling eyes 0.4499772
## 3 <f0><9f><98><82> face with tears of joy 0.2209684
## 4 <f0><9f><98><83> grinning face with big eyes 0.5580431
## 5 <f0><9f><98><84> grinning face with smiling eyes 0.4220315
# changing formatting in dictionary
EmojiNames <- paste0("emoji_",gsub(" ","",EmojiSentiments$name))
EmojiSentiment <- cbind.data.frame(EmojiNames,
EmojiSentiments$sentiment,
EmojiSentiments$polarity)
# naming
names(EmojiSentiment) <- c("word","sentiment","valence")
# see results
EmojiSentiment[1:5,]
## word sentiment valence
## 1 emoji_grinningface 0.5717540 positive
## 2 emoji_beamingfacewithsmilingeyes 0.4499772 positive
## 3 emoji_facewithtearsofjoy 0.2209684 positive
## 4 emoji_grinningfacewithbigeyes 0.5580431 positive
## 5 emoji_grinningfacewithsmilingeyes 0.4220315 positive
# we then tokenize the emoji-only column in our formatted dataframe
EmojiToks <- tokens(tolower(as.character(unlist(comments$Emoji))))
EmojiToks[130:131]
## tokens from 2 documents.
## text130 :
## [1] "emoji_facewithtearsofjoy" "emoji_facewithtearsofjoy"
##
## text131 :
## [1] "emoji_facewithtearsofjoy"
# Creating dictionary object
EmojiSentDict <- as.dictionary(EmojiSentiment[,1:2])
# Replacing Emoji with sentiment scores
EmojiToksSent <- tokens_lookup(x = EmojiToks,
dictionary = EmojiSentDict)
EmojiToksSent[130:131]
## tokens from 2 documents.
## text130 :
## [1] "0.220968403775133" "0.220968403775133"
##
## text131 :
## [1] "0.220968403775133"
Plot the distribution of EmojiToksSent
You can use the simple hist() function to create a histogram. Keep in mind though that you need to transform the tokens object back to a regular numeric vector. You can do this with the unlist() and as.numeric() functions.
hist(as.numeric(unlist(EmojiToksSent)),
main = "Distribution of Emoji Sentiment",
xlab = "Emoji Sentiment")